1.数据集“Movies Dataset from Pirated Sites”的数据分析
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm, trange
1.1 将数据读入为DataFrame,并展示数据中的数据类型
movie_data = pd.read_csv('/content/data/MyDrive/data/movies_dataset.csv')
print(movie_data.head(5))
print(movie_data.dtypes)
Unnamed: 0 IMDb-rating appropriate_for director downloads id \
0 0 4.8 R John Swab 304 372092
1 1 6.4 TV-PG Paul Ziller 73 372091
2 2 5.2 R Ben Wheatley 1,427 343381
3 3 8.1 NaN Venky Atluri 1,549 372090
4 4 4.6 NaN Shaji Kailas 657 372089
industry language posted_date release_date run_time \
0 Hollywood / English English 20 Feb, 2023 Jan 28 2023 105
1 Hollywood / English English 20 Feb, 2023 Feb 05 2023 84
2 Hollywood / English English,Hindi 20 Apr, 2021 Jun 18 2021 1h 47min
3 Tollywood Hindi 20 Feb, 2023 Feb 17 2023 139
4 Tollywood Hindi 20 Feb, 2023 Jan 26 2023 122
storyline \
0 Doc\r\n facilitates a fragile truce between th...
1 Caterer\r\n Goldy Berry reunites with detectiv...
2 As the world searches for a cure to a disastro...
3 The life of a young man and his struggles agai...
4 A man named Kalidas gets stranded due to the p...
title views writer
0 Little Dixie 2,794 John Swab
1 Grilling Season: A Curious Caterer Mystery 1,002 John Christian Plummer
2 In the Earth 14,419 Ben Wheatley
3 Vaathi 4,878 Venky Atluri
4 Alone 2,438 Rajesh Jayaraman
Unnamed: 0 int64
IMDb-rating float64
appropriate_for object
director object
downloads object
id int64
industry object
language object
posted_date object
release_date object
run_time object
storyline object
title object
views object
writer object
dtype: object
由上述数据类型和数据展示可以判断,“id”、“appropriate_for”、“language”、“industry”等属性为标称属性;“IMDb-rating”、“downloads”等属性为数值属性。
1.2 处理部分属性数据
可删去“Unnamed: 0”这个与数据分析无关的属性列。
movie_data = movie_data.drop(['Unnamed: 0'],axis = 1)
将“run_time”、“downloads”、“views”转为float型数据。
#转换downloads和views
for col in 'downloads', 'views':
movie_data[col] = movie_data[col].str.replace(',','')
movie_data[col] = movie_data[col].astype('float')
#转换run_time为以min为单位的float型数据
new = []
for item in movie_data['run_time']:
if "h" in str(item) and ("min" in str(item) or "m" in str(item)):
item = item.strip('min')
hour, min = item.split('h ')
time = float(hour)*60 + float(min)
new.append(time)
elif "h" in str(item):
hour = item.strip('h')
time = float(hour)*60
new.append(time)
elif "min" in str(item) or "m" in str(item):
min = item.strip('min')
time = float(min)
new.append(time)
else:
new.append(float(item))
movie_data['run_time'] = new
将“posted_date”和“release_date”转为datetime数据。
movie_data['posted_date'] = pd.to_datetime(movie_data['posted_date'])
movie_data['release_date'] = pd.to_datetime(movie_data['release_date'])
1.2 展示数据摘要
(1)首先,检查统计的电影是否有重复,进行去重处理,将数值属性数据的最大值保留。
print(movie_data['id'].value_counts())
372090 402
371744 402
371877 402
372092 202
371991 202
...
303381 1
303380 1
303379 1
303377 1
30459 1
Name: id, Length: 17086, dtype: int64
non_unique_id_mov_data = movie_data.groupby('id').agg(id_count=('id', 'count')).query('id_count > 1').sort_values(by='id_count', ascending=False).index.to_list()
for id in non_unique_id_mov_data:
non_unique_id_df = movie_data.query('id == @id').groupby('id').agg(max_downloads=('downloads', 'max'), max_views=('views', 'max'))
max_downloads = non_unique_id_df['max_downloads'].iloc[0]
max_views = non_unique_id_df['max_views'].iloc[0]
movie_data.loc[movie_data['id'] == id, 'downloads'] = max_downloads
movie_data.loc[movie_data['id'] == id, 'views'] = max_views
movie_data = movie_data.drop_duplicates()
(2)接着,展示标称属性的值和频数统计,此处将给出每种可能值的频数统计,并且将展示频数的TOP20的柱状图。
nominal = ['appropriate_for','director','industry','language','writer','title','run_time','posted_date','release_date']
对于“appropriate_for”中的Unrated和Not Rated进一步处理:
movie_data['appropriate_for'] = movie_data['appropriate_for'].replace('Unrated', 'Not Rated')
接着画出柱状图:
def bar_plot(data, feature_list):
for col in feature_list:
print(data[col].value_counts())
v = list(data[col].value_counts().values)
id = list(data[col].value_counts().index)
bar = plt.barh(id[:20],v[:20])
plt.yticks(rotation = 45, size = 7)
plt.xticks(rotation = 45, size = 7)
plt.bar_label(bar, v[:20])
plt.show()
bar_plot(movie_data,nominal[:-2])
R 3752 Not Rated 2264 PG-13 1763 PG 884 TV-MA 406 TV-14 293 G 140 TV-PG 115 TV-G 99 TV-Y7 44 TV-Y 25 Approved 9 NC-17 4 TV-Y7-FV 3 Passed 3 MA-17 1 TV-13 1 Drama 1 Drama, Romance 1 18+ 1 Name: appropriate_for, dtype: int64
David Dhawan 32
Ram Gopal Varma 27
Vikram Bhatt 25
Steven Spielberg 24
Priyadarshan 22
..
Mike Gan 1
Santhosh P. Jayakumar 1
Johnny Mitchell 1
Vijay Yelakanti 1
Biren Nag 1
Name: director, Length: 9672, dtype: int64
Hollywood / English 12197 Bollywood / Indian 2420 Anime / Kids 1028 Tollywood 769 Punjabi 331 Stage shows 124 Pakistani 92 Wrestling 78 Dub / Dual Audio 45 3D Movies 1 Name: industry, dtype: int64
English 10052
Hindi 1938
English,Spanish 388
Punjabi 309
Telugu 297
...
English,Spanish,Armenian 1
Bosnian,English,Croatian,German,Serbian 1
English,Korean,Spanish 1
Norwegian,Swedish 1
Spanish,German,English 1
Name: language, Length: 1168, dtype: int64
Naresh Kathuria 11
Dheeraj Rattan 11
Andrew Jones 11
Jagdeep Singh 10
Tyler Perry 10
..
Jack Ketchum 1
Kraig Wenman, Peter Sullivan 1
Robert Bruzio 1
Lynn Shelton, Michael Patrick O'Brien 1
Khwaja Ahmad Abbas, Khwaja Ahmad Abbas 1
Name: writer, Length: 13603, dtype: int64
Pinocchio 5
Alone 5
True Justice 5
Sacrifice 5
Blood Money 4
..
The Professor 1
A Violent Separation 1
We Have Always Lived in the Castle 1
Valentine 1
Madhumati 1
Name: title, Length: 16572, dtype: int64
90.0 851
95.0 470
92.0 447
93.0 427
85.0 418
...
51.0 1
26.0 1
28.0 1
32.0 1
271.0 1
Name: run_time, Length: 200, dtype: int64
统计“language”属性中具体语言种类与频数:
mov_language = movie_data['language'].astype('str')
for i in range(len(mov_language)):
mov_language.iloc[i] = mov_language.iloc[i].split(',')
for j in range(len(mov_language.iloc[i])):
mov_language.iloc[i][j] = mov_language.iloc[i][j].lstrip(' ')
mov_language_dict = {}
for i in range(len(mov_language)):
for j in range(len(mov_language.iloc[i])):
if mov_language.iloc[i][j] not in mov_language_dict:
mov_language_dict['{}'.format(mov_language.iloc[i][j])] = 0
mov_language_dict['{}'.format(mov_language.iloc[i][j])] += 1
mov_language_dict = dict(sorted(mov_language_dict.items(),key = lambda x:x[1],reverse = True))
print(mov_language_dict)
{'English': 13055, 'Hindi': 2638, 'Spanish': 775, 'French': 643, 'nan': 534, 'Telugu': 412, 'Punjabi': 402, 'German': 372, 'Russian': 323, 'Tamil': 319, 'Italian': 251, 'Japanese': 212, 'Mandarin': 192, 'Arabic': 162, 'Urdu': 160, 'Latin': 109, 'Cantonese': 97, 'Chinese': 97, 'Marathi': 87, 'Korean': 80, 'Portuguese': 80, 'Kannada': 73, 'Malayalam': 73, 'Ukrainian': 70, 'Hebrew': 60, 'Thai': 54, 'Dutch': 46, 'Greek': 44, 'Polish': 43, 'Norwegian': 42, 'Swedish': 39, 'Danish': 36, 'Turkish': 36, 'Persian': 35, 'American Sign Language': 35, 'Bengali': 35, 'Romanian': 31, 'Vietnamese': 28, 'English Hindi': 28, 'Hungarian': 27, 'Serbian': 26, 'Indonesian': 26, 'Czech': 24, 'Panjabi': 22, 'Afrikaans': 20, 'Gujarati': 15, 'Icelandic': 15, 'Bulgarian': 13, 'Aboriginal': 13, 'Finnish': 12, 'Gaelic': 12, 'Irish Gaelic': 11, 'Welsh': 11, 'Xhosa': 11, 'Sanskrit': 11, 'Yiddish': 11, 'Pashtu': 11, 'Swahili': 10, 'Rajasthani': 10, 'Filipino': 9, 'None': 9, 'Bhojpuri': 9, 'North American Indian': 8, 'Hawaiian': 8, 'Somali': 8, 'Tagalog': 8, 'Armenian': 8, 'Irish': 8, 'Maori': 7, 'Haryanvi': 7, 'Malay': 7, 'Serbo-Croatian': 7, 'Central Khmer': 7, 'Swiss German': 6, 'Zulu': 6, 'Croatian': 6, 'Albanian': 6, 'Inuktitut': 6, 'Sign Languages': 6, 'Bosnian': 6, 'Esperanto': 6, 'Min Nan': 5, 'Nepali': 5, 'Estonian': 5, 'Tibetan': 5, 'Yoruba': 4, 'Norse': 4, 'Old': 4, 'Ancient (to 1453)': 4, 'Dari': 4, 'Mongolian': 4, 'Lithuanian': 4, 'Scottish Gaelic': 4, 'Old English': 4, 'Flemish': 4, 'Egyptian (Ancient)': 4, 'Scots': 4, 'Sindhi': 4, 'Kazakh': 3, 'Kashmiri': 3, 'Pushto': 3, 'Burmese': 3, 'Kurdish': 3, 'Saami': 3, 'Lingala': 3, 'Catalan': 3, 'Klingon': 3, 'Shanghainese': 3, 'Sindarin': 3, 'Akan': 2, 'Maltese': 2, 'Samoan': 2, 'Cree': 2, 'Tupi': 2, 'Macedonian': 2, 'Georgian': 2, 'Amharic': 2, 'Micmac': 2, 'Tswana': 2, 'Basque': 2, 'Quechua': 2, 'Navajo': 2, 'Ibo': 2, 'Nyanja': 2, 'Sotho': 2, 'Sioux': 2, 'Papiamento': 2, 'Slovenian': 2, 'Oriya': 2, 'Lao': 2, 'Southern Sotho': 2, 'Aramaic': 2, 'Brazilian Sign Language': 2, 'Tulu': 2, 'Maya': 2, 'Khmer': 2, 'Romany': 2, 'Awadhi': 2, 'Bambara': 1, 'Sumerian': 1, 'Ojibwa': 1, 'Chechen': 1, 'Shoshoni': 1, 'Nahuatl': 1, 'Bable': 1, 'Tonga (Tonga Islands)': 1, 'Spanish Sign Language': 1, 'Wolof': 1, 'Dinka': 1, 'Chaozhou': 1, 'Russian Sign Language': 1, 'Parsee': 1, 'Visayan': 1, 'Rhaetian': 1, 'Turkmen': 1, 'Middle English': 1, 'p': 1, 'British Sign Language': 1, 'Igbo': 1, 'Hokkien': 1, 'Nama': 1, 'Polynesian': 1, 'Cheyenne': 1, 'Uzbek': 1, 'Mohawk': 1, 'Kikuyu': 1, 'Azerbaijani': 1, 'Assamese': 1, 'Abkhazian': 1, 'Pawnee': 1, 'Australian Sign Language': 1, 'Konkani': 1, 'Syriac': 1, 'Greenlandic': 1, 'Indian Sign Language': 1, 'English.Hindi': 1, 'Quenya': 1, 'Sinhala': 1, 'Tok Pisin': 1, 'Sicilian': 1, 'Slovak': 1, 'Acholi': 1, 'Apache languages': 1, 'Low German': 1, 'Kru': 1, 'Mende': 1, 'Algonquin': 1}
x = list(mov_language_dict.keys())[:20]
y = list(mov_language_dict.values())[:20]
bar = plt.barh(x,y)
plt.yticks(rotation = 45, size = 7)
plt.xticks(rotation = 45, size = 7)
plt.bar_label(bar, y)
plt.show()
对于上映日期和平台发布日期,通过直方图展示其分布。
print(movie_data['release_date'].value_counts())
movie_data['release_date'].hist()
plt.title('release_date distribution')
plt.show()
print(movie_data['posted_date'].value_counts())
movie_data['posted_date'].hist()
plt.title('posted_date distribution')
plt.show()
1970-01-01 958
2022-10-07 29
2020-02-14 28
2019-12-06 28
2015-10-16 27
...
1999-07-16 1
1994-07-15 1
1997-09-26 1
1993-09-17 1
1958-03-28 1
Name: release_date, Length: 4886, dtype: int64
1970-01-01 38
2014-01-03 24
2014-09-11 21
2013-02-25 20
2014-02-01 16
..
2009-05-26 1
2009-05-25 1
2009-05-18 1
2009-05-10 1
2010-03-27 1
Name: posted_date, Length: 4123, dtype: int64
(3)下一步,展示数值属性的5数概括及缺失值个数,并通过直方图展示数据分布及离群点。
numeric = ['views','downloads','IMDb-rating']
#利用describe函数给出5数概括
print(movie_data[numeric].describe().loc[['max', '75%', '50%', '25%', 'min']])
views downloads IMDb-rating max 1638533.0 391272.0 9.9 75% 43700.0 12224.0 6.5 50% 19207.0 3757.0 5.7 25% 9899.0 1287.0 4.7 min 667.0 0.0 1.1
# 绘制数值属性的直方图
def hist_plot(data, feature_list):
for col in feature_list:
data[col].hist(bins=12,histtype="bar",alpha=0.5)
plt.title(f'Distribution of {col}')
plt.show()
hist_plot(movie_data,numeric)
1.3 处理缺失值
#检查缺失值个数,展现缺失值比例
data_null = movie_data.isnull().sum(axis=0)
print(data_null.T)
rate = data_null/len(movie_data)
print(rate)
IMDb-rating 477 appropriate_for 7277 director 1566 downloads 1 id 0 industry 1 language 534 posted_date 1 release_date 1 run_time 1000 storyline 1322 title 1 views 1 writer 1820 dtype: int64 IMDb-rating 0.027918 appropriate_for 0.425904 director 0.091654 downloads 0.000059 id 0.000000 industry 0.000059 language 0.031254 posted_date 0.000059 release_date 0.000059 run_time 0.058527 storyline 0.077373 title 0.000059 views 0.000059 writer 0.106520 dtype: float64
由上述结果可知,appropriate_for属性几乎有一半的数据缺失,缺失率高达42.6%;storyline、run_time、director、writer属性的缺失值较多,分别为7.7%、5.9%、9.2%、10.7%。
用不同方法处理缺失值:
(1)将缺失部分剔除:按行删去含有任意空值的数据
drop_data = movie_data.dropna(axis=0)
print(f"未删除含有空值行前的数据数:{len(movie_data)}")
print(f"删除含有空值行后的数据数:{len(drop_data)}")
print(f"数据保留率:{len(drop_data)/len(movie_data)*100}%")
未删除含有空值行前的数据数:17086 删除含有空值行后的数据数:8650 数据保留率:50.62624370829919%
根据上述结果,可以发现,若删去含空值行,数据量将变为仅原先的一半,由先前的分析可以知道删去的数据大多是因为缺少“appropriate_for”值,显然这样的处理是不合理的。
可以根据删除前后的部分属性的柱状图来对比新旧数据差异:
def process_show(old_data, new_data, feature):
feature_count = pd.DataFrame(old_data[feature].value_counts()).rename(columns={feature: 'feature_count'}).sort_values(by='feature_count', ascending=True)
feature_count_cleaned = feature_count
feature_count_cleaned['feature_count_cleaned'] = [0] * len(feature_count)
for level in list(feature_count.index):
if level in list(new_data[feature].value_counts().index):
feature_count_cleaned.loc[[level], ['feature_count_cleaned']] = new_data[feature].value_counts().loc[[level]].values[0]
plt.figure(figsize=(20, 20))
# plt.barh(feature_count_cleaned.index, width=feature_count_cleaned['feature_count'], label='feature_count')
# plt.barh(feature_count_cleaned.index, width=feature_count_cleaned['feature_count_cleaned'], label='feature_count_cleaned')
bar1 = plt.barh([d+0.42 for d in list(range(len(feature_count_cleaned)))], tick_label=feature_count_cleaned.index, width=feature_count_cleaned['feature_count'], label='feature_count', height=0.4)
bar2 = plt.barh(list(range(len(feature_count_cleaned))), tick_label=feature_count_cleaned.index, width=feature_count_cleaned['feature_count_cleaned'], label='feature_count_cleaned', height=0.4)
plt.yticks(rotation = 45, fontsize=20)
plt.xticks(rotation = 45, fontsize=20)
plt.bar_label(bar1,feature_count_cleaned['feature_count'],fontsize=16)
plt.bar_label(bar2,feature_count_cleaned['feature_count_cleaned'], fontsize=16)
plt.title(f"Comparison of {feature}", fontsize=24)
plt.legend(fontsize=24, loc='lower right')
plt.show()
process_show(movie_data,drop_data,'appropriate_for')
(2)用最高频率值来填补缺失值
def fill_highest_fre(data, feature_list):
dict = {}
for col in feature_list:
top = data[col].value_counts().index
dict[col] = list(top)[0]
new_data = data.fillna(dict)
return new_data
fill_data = fill_highest_fre(movie_data,movie_data.columns)
对于标称属性,用最高频率值填充缺失值,并不会影响其分布,只是最高频率值的频率将增加,仍以appropriate_for属性为例:
process_show(movie_data,fill_data,'appropriate_for')
展示经过删除缺失值和用最高频率填充缺失值这两种方法的数值属性的盒图对比:
def box_compare(data1,data2, feature_list):
for col in feature_list:
l1 = list(data1[col])
l2 = list(data2[col])
labels = [f'{col} of drop_data',f'{col} of fill_data']
plt.grid(True)
plt.boxplot([l1,l2],
medianprops={'color': 'red', 'linewidth': '1.5'},
meanline=True,
showmeans=True,
meanprops={'color': 'blue', 'ls': '--', 'linewidth': '1.5'},
flierprops={"marker": "o", "markerfacecolor": "red", "markersize": 7},
labels=labels)
plt.show()
box_compare(drop_data,fill_data,numeric)
(3)通过属性的相关关系来填补缺失值
针对缺失率最高的appropriate_for属性,我们尝试使用计算其与其他属性的相关系数来填补缺失值。
首先将appropriate_for的值进行编码,用spearman方法计算相关系数,并画出热力图。
code_data = pd.get_dummies(movie_data, columns=['appropriate_for'], dummy_na=True, drop_first=True)
plt.figure(figsize=(40, 40))
sns.heatmap(code_data.corr(method='spearman'), cmap='YlGnBu', annot=True)
plt.title('Correlation Analysis')
<ipython-input-53-342ad9129335>:3: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. sns.heatmap(code_data.corr(method='spearman'), cmap='YlGnBu', annot=True)
Text(0.5, 1.0, 'Correlation Analysis')
我们可以从热图中看到,appropriate_for与其他属性是弱相关的(不大于0.3)。其中,我们可以发现IMDb-rating是所有属性中最相关的属性,故利用IMDb-ranking来完成这个补全。但,根据相关系数值我们可以猜测使用属性之间的相关性来替换列appropriate_for中的缺失值并不是好的处理。
def feature_corr_fill(data,fill_feature,corr_feature):
mean_df = data.groupby(fill_feature).agg(avg=(corr_feature, 'mean'))
corr_data = data.copy(deep=True)
for i in trange(len(corr_data)):
if corr_data[fill_feature].iloc[i] is np.nan:
rate = corr_data[corr_feature].iloc[i]
dists = []
for j in range(len(mean_df)):
dists.append(abs(mean_df.iloc[j]['avg']-rate))
array = np.array(dists)
sorted_index = np.argsort(array)
idx = sorted_index[0]
corr_data[fill_feature].iloc[i] = mean_df.index[idx]
return corr_data
corr_data = feature_corr_fill(movie_data,'appropriate_for','IMDb-rating')
0%| | 0/17086 [00:00<?, ?it/s]<ipython-input-54-e58f7893bd26>:13: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy corr_data[fill_feature].iloc[i] = mean_df.index[idx] 100%|██████████| 17086/17086 [00:08<00:00, 2033.19it/s]
展示填充前后柱状图对比:
process_show(movie_data,corr_data,'appropriate_for')
(4)通过数据对象之间的相似性来填补缺失值
针对缺失率最高的appropriate_for属性,我们尝试使用计算数据对象间的相似性来填补缺失值。首先对数值属性进行归一化(regularit函数);接着,计算各个例子的数值属性的均方差,找到最相似的例子的appropriate_for的取值填充。
def sample_corr_fill(data, fill_feature, feature_list):
def regularit(data,feature_list):
new_df = pd.DataFrame(index=data.index)
for col in feature_list:
d = data[col]
MAX = d.max()
MIN = d.min()
new_df[col] = ((d - MIN) / (d - MAX))
return new_df
def eucliDist(X,Y):
return np.sqrt(sum(np.power((X - Y), 2)))
def manhDist(X,Y):
return np.sum(abs(X-Y))
reg_data = regularit(data,feature_list)
print("Show regularized data:")
print(reg_data.head(20))
no_nan_data = pd.concat([reg_data,data[fill_feature]],axis=1).dropna(axis=0,inplace=False)
nan_data = pd.concat([reg_data,data[fill_feature]], axis=1)
nan_data = nan_data[nan_data[fill_feature].isnull() == True]
#print(nan_data.head(20))
sample_corr_data = data.copy(deep=True)
nan_index = nan_data.index
no_nan_index = no_nan_data.index
for id_nan in tqdm(nan_index,total=len(nan_data),desc='Calculating similarity...'):
dists = {}
for id_sample in no_nan_index:
dist = manhDist(np.array(reg_data.loc[id_nan]),np.array(reg_data.loc[id_sample]))
dists[id_sample] = dist
order = sorted(dists.items(),key=lambda x:x[1],reverse=False)
idx = order[0][0]
sample_corr_data[fill_feature].loc[id_nan] = sample_corr_data[fill_feature].loc[idx]
return sample_corr_data
sample_corr_data = sample_corr_fill(movie_data.iloc[0:3000],'appropriate_for',numeric)
Show regularized data:
views downloads IMDb-rating
0 -0.005675 -0.001581 -0.725490
1 -0.000501 -0.000082 -1.514286
2 -0.020988 -0.005414 -0.872340
3 -0.010916 -0.009198 -3.888889
4 -0.002654 -0.002375 -0.660377
5 -0.003409 -0.002726 -0.955556
6 -0.017198 -0.021116 NaN
7 -0.026953 -0.006818 -1.588235
8 -0.009423 -0.001593 -1.933333
9 -0.013702 -0.007549 -0.543860
10 -0.005941 -0.002710 -1.378378
11 -0.056824 -0.054143 -8.777778
12 -0.007228 -0.008695 NaN
13 -0.061454 -0.061598 -1.666667
14 -0.008122 -0.001648 -2.142857
15 -0.232568 -0.193860 -1.588235
16 -0.018732 -0.010820 NaN
17 -0.058540 -0.051867 -1.514286
18 -0.000000 -0.000466 NaN
19 -0.017051 -0.005533 -0.692308
Calculating similarity...: 0%| | 0/1643 [00:00<?, ?it/s]<ipython-input-57-c040e28d90ba>:35: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy sample_corr_data[fill_feature].loc[id_nan] = sample_corr_data[fill_feature].loc[idx] Calculating similarity...: 100%|██████████| 1643/1643 [05:12<00:00, 5.25it/s]
鉴于数据维度较大,运行时间过长,将只展示填充数据前3000行的填充效果:
process_show(movie_data.iloc[0:3000],sample_corr_data,'appropriate_for')
2.数据集“GitHub Dataset”的数据分析
2.1 将数据读入为DataFrame,并展示数据中的数据类型
github_data = pd.read_csv('/content/data/MyDrive/data/github_dataset.csv')
print(github_data.head(5))
print(github_data.dtypes)
repositories stars_count forks_count issues_count \ 0 octocat/Hello-World 0 0 612 1 EddieHubCommunity/support 271 150 536 2 ethereum/aleth 0 0 313 3 localstack/localstack 0 0 290 4 education/classroom 0 589 202 pull_requests contributors language 0 316 2 NaN 1 6 71 NaN 2 27 154 C++ 3 30 434 Python 4 22 67 Ruby repositories object stars_count int64 forks_count int64 issues_count int64 pull_requests int64 contributors int64 language object dtype: object
由上述数据类型和数据展示可以判断,“repositories”、“language”这两个属性为标称属性;“stars_count”、“forks_count”、“issues_count”、“pull_requests”、“contributors”属性为数值属性。
2.2 展示数据摘要
(1)首先,检查数据是否有重复,进行去重处理。
print(github_data['repositories'].value_counts())
kameshsampath/ansible-role-rosa-demos 2
aloisdeniel/bluff 2
antoniaandreou/github-slideshow 2
jgthms/bulma-start 2
artkirienko/hlds-docker-dproto 2
..
WhiteHouse/CIOmanagement 1
0xCaso/defillama-telegram-bot 1
ethereum/blake2b-py 1
openfoodfacts/folksonomy_mobile_experiment 1
gamemann/All_PropHealth 1
Name: repositories, Length: 972, dtype: int64
github_data = github_data.drop_duplicates()
print(github_data['repositories'].value_counts())
octocat/Hello-World 1
ethereum/cable 1
feross/blob-to-buffer 1
studiomohawk/js-testing-boilerplates 1
studiomohawk/jekyll-theme-the_minimum 1
..
AnkushMalaker/speech-emotion-recognition 1
mikenikles/ghost-v3-google-cloud-storage 1
yourcelf/afgexplorer 1
trailofbits/circuitous-benchmarks 1
gamemann/All_PropHealth 1
Name: repositories, Length: 972, dtype: int64
(2)接着,展示标称属性的值和频数统计,此处将给出每种可能值的频数统计,并且将展示频数的TOP20的柱状图。
以下将处理“repositories”属性,将对该属性的值进行分词处理,例如“octocat/Hello-World”,将以“/”为分词符提取作者名“octcoat”和项目名“Hello-World”,并将作者名与项目名存入github_data数据中,分别为“author”和“project”,两者均为标称属性。
authors = []
projects = []
for item in github_data['repositories']:
au , pro = item.split('/')
authors.append(au)
projects.append(pro)
github_data['author'] = authors
github_data['project'] = projects
print(github_data.head(5))
repositories stars_count forks_count issues_count \ 0 octocat/Hello-World 0 0 612 1 EddieHubCommunity/support 271 150 536 2 ethereum/aleth 0 0 313 3 localstack/localstack 0 0 290 4 education/classroom 0 589 202 pull_requests contributors language author project 0 316 2 NaN octocat Hello-World 1 6 71 NaN EddieHubCommunity support 2 27 154 C++ ethereum aleth 3 30 434 Python localstack localstack 4 22 67 Ruby education classroom
nominal = ['author','project','language']
bar_plot(github_data,nominal)
EddieHubCommunity 21
openfoodfacts 20
ethereum 18
bradtraversy 16
google 16
..
Orbiter 1
motasimmakki 1
dylanbeattie 1
AnandBaburajan 1
gamemann 1
Name: author, Length: 296, dtype: int64
github-slideshow 11
dotfiles 5
.github 3
blog 3
chat 2
..
booklee 1
DiscrimFarmer 1
bext 1
deposit-verifier 1
All_PropHealth 1
Name: project, Length: 946, dtype: int64
JavaScript 237 Python 151 HTML 65 Java 43 TypeScript 34 CSS 30 Dart 30 C++ 28 Jupyter Notebook 28 Ruby 24 Shell 24 C 23 PHP 16 Go 15 Rust 10 Swift 10 Objective-C 8 Kotlin 7 C# 7 SCSS 4 Makefile 4 Jinja 3 Dockerfile 3 Solidity 3 CoffeeScript 3 AutoHotkey 3 Hack 2 CodeQL 2 PowerShell 2 Vue 2 Assembly 2 Vim Script 2 Perl 2 Elixir 2 Gherkin 1 Cuda 1 QMake 1 Pawn 1 CMake 1 Oz 1 QML 1 ActionScript 1 Roff 1 HCL 1 R 1 PureBasic 1 Smarty 1 Less 1 Svelte 1 Haskell 1 SourcePawn 1 Name: language, dtype: int64
(3)下一步,展示数值属性的5数概括及缺失值个数,并通过直方图和盒图展示数据分布及离群点。
numeric = ['stars_count','forks_count','issues_count','pull_requests','contributors']
print(github_data[numeric].describe().loc[['max', '75%', '50%', '25%', 'min']])
stars_count forks_count issues_count pull_requests contributors max 995.0 973.0 612.0 567.0 658.0 75% 65.0 37.0 6.0 2.0 4.0 50% 12.0 6.0 2.0 0.0 2.0 25% 1.0 1.0 1.0 0.0 0.0 min 0.0 0.0 1.0 0.0 0.0
hist_plot(github_data,numeric)
def box_plot(data, feature_list):
l = []
for col in feature_list:
ll = list(data[col])
l.append(ll)
plt.grid(True)
plt.boxplot(l,
medianprops={'color': 'red', 'linewidth': '1.5'},
meanline=True,
showmeans=True,
meanprops={'color': 'blue', 'ls': '--', 'linewidth': '1.5'},
flierprops={"marker": "o", "markerfacecolor": "red", "markersize": 7},
labels=feature_list)
plt.show()
box_plot(github_data,numeric)
2.3 处理缺失值
#检查缺失值个数,展现缺失值比例
data_null = github_data.isnull().sum(axis=0)
print(data_null.T)
rate = data_null/len(github_data)
print(rate)
repositories 0 stars_count 0 forks_count 0 issues_count 0 pull_requests 0 contributors 0 language 126 author 0 project 0 dtype: int64 repositories 0.00000 stars_count 0.00000 forks_count 0.00000 issues_count 0.00000 pull_requests 0.00000 contributors 0.00000 language 0.12963 author 0.00000 project 0.00000 dtype: float64
由以上结果可知,仅“language”属性含缺失值,接下来用4种方法填充“language”中的缺失值:
(1)将缺失部分剔除:按行删去含有任意空值的数据
drop_data = github_data.dropna(axis=0)
print(f"未删除含有空值行前的数据数:{len(github_data)}")
print(f"删除含有空值行后的数据数:{len(drop_data)}")
print(f"数据保留率:{len(drop_data)/len(github_data)*100}%")
未删除含有空值行前的数据数:972 删除含有空值行后的数据数:846 数据保留率:87.03703703703704%
由上面的结果可知,本数据的数据量较小,且数据中空值较少,因此删去空值后数据保留率可达到86%,是效果较为理想。
(2)用最高频率值来填补缺失值
fill_data = fill_highest_fre(github_data,['language'])
process_show(github_data, fill_data, 'language')
(3)通过属性的相关关系来填补缺失值
由相关系数热力图可以看出,“language”属性与其他的数值属性的相关度不高,可以猜测,利用属性的相关关系来填充数据效果将不一定理想。此处,我尝试选择利用“forks_count”属性来填充“language”属性的缺失值。
feature_corr_data = feature_corr_fill(github_data,'language','forks_count')
process_show(github_data,feature_corr_data,'language')
0%| | 0/972 [00:00<?, ?it/s]<ipython-input-54-e58f7893bd26>:13: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy corr_data[fill_feature].iloc[i] = mean_df.index[idx] 100%|██████████| 972/972 [00:00<00:00, 3661.19it/s]
(4)通过数据对象之间的相似性来填补缺失值
sample_corr_data = sample_corr_fill(github_data,'language',numeric)
Show regularized data:
stars_count forks_count issues_count pull_requests contributors
0 -0.000000 -0.000000 inf -1.258964 -0.003049
1 -0.374309 -0.182260 -7.039474 -0.010695 -0.120954
2 -0.000000 -0.000000 -1.043478 -0.050000 -0.305556
3 -0.000000 -0.000000 -0.897516 -0.055866 -1.937500
4 -0.000000 -1.533854 -0.490244 -0.040367 -0.113367
5 -0.000000 -0.000000 -0.388636 -0.000000 -0.004580
6 -0.000000 -0.000000 -0.363839 -0.406948 -0.211786
7 -0.000000 -0.000000 -0.325380 -0.389706 -1.366906
9 -24.512821 -0.422515 -0.300000 -0.029038 -0.394068
10 -0.004036 -0.000000 -0.234343 -0.000000 -0.000000
11 -0.000000 -0.000000 -0.229376 -0.017953 -0.329293
12 -0.305774 -0.255484 -0.207510 -0.032787 -0.000000
13 -0.000000 -0.000000 -0.159393 -0.010695 -0.345603
14 -0.000000 -0.001029 -0.150659 -0.005319 -0.000000
15 -0.005051 -0.002060 -0.150659 -0.053903 -0.388186
16 -0.000000 -1.390663 -0.137803 -0.010695 -0.010753
17 -0.000000 -0.569355 -0.117002 -0.029038 -0.039494
18 -0.000000 -56.235294 -0.112933 -0.040367 inf
19 -0.000000 -0.210199 -0.108893 -0.010695 -0.124786
20 -0.000000 -0.666096 -0.104882 -0.010695 -0.128645
Calculating similarity...: 0%| | 0/126 [00:00<?, ?it/s]<ipython-input-57-c040e28d90ba>:35: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy sample_corr_data[fill_feature].loc[id_nan] = sample_corr_data[fill_feature].loc[idx] Calculating similarity...: 100%|██████████| 126/126 [00:16<00:00, 7.66it/s]
process_show(github_data,sample_corr_data,'language')